\chapter{Hart to encoder interface} \label{Interface}

\section{Instruction Trace Interface requirements}\label{sec:InstructionInterfaceRequirements}

This section describes in general terms the information which must be
passed from the RISC-V hart to the trace encoder for the purposes of
Instruction Trace, and distinguishes between what is mandatory, and
what is optional.

The following information is mandatory:

\begin{itemize}
  \item The number of instructions that are being retired;
  \item Whether there has been an exception or interrupt, and if so the cause 
        (from the \textbf{\textit{ucause/scause/vscause/mcause}} CSR)
        and trap value (from the \textbf{\textit{utval/stval/vstval/mtval}} CSR).

The register set to output should be the set that is updated as a result of the exception (i.e. 
the set associated with the privilege level immediately following the exception);

  \item The current privilege level of the RISC-V hart;
  \item The \textit{instruction\_type} of retired instructions for:
    \begin{itemize}
      \item Jumps with a target that cannot be inferred from the source code;
      \item Taken and nontaken branches;
      \item Return from exception or interrupt (\textbf{\textit{*ret}} instructions).
    \end{itemize}
  \item The \textit{instruction\_address} for:
    \begin{itemize}
      \item Jumps with a target that \textit{cannot} be inferred from the source code;
      \item The instruction retired immediately after a jump with a target that \textit{cannot} be inferred 
        from the source code (also referred to as the target or destination of the jump);
      \item Taken and nontaken branches;
      \item The last instruction retired before an exception or interrupt;
      \item The first instruction retired following an exception or interrupt;
      \item The last instruction retired before a privilege change;
      \item The first instruction retired following a privilege change.
    \end{itemize}
\end{itemize}

The following information is optional:

\begin{itemize}
  \item Context or Time information:
    \begin{itemize}
      \item The context and/or hart ID and/or time;
      \item The type of action to take when context or time data changes.
    \end{itemize}
  \item The \textit{instruction\_type} of instructions for:
    \begin{itemize}
      \item Calls with a target that \textit{cannot} be inferred from the source code;
      \item Calls with a target that \textit{can} be inferred from the source code;
      \item Tail-calls with a target that \textit{cannot} be inferred from the source code;
      \item Tail-calls with a target that \textit{can} be inferred from the source code;
      \item Returns with a target that \textit{cannot} be inferred from the source code;
      \item Returns with a target that \textit{can} be inferred from the source code;
      \item Co-routine swap;
      \item Jumps which don't fit any of the above classifications with a target that \textit{cannot} be inferred from the source code;
      \item Jumps which don't fit any of the above classifications with a target that \textit{can} be inferred from the source code.
    \end{itemize}
  \item If context or time is supported then the \textit{instruction\_address} for:
    \begin{itemize}
      \item The last instruction retired before a context or a time change;
      \item The first instruction retired following a context or time change.
    \end{itemize}
  \item Whether jump targets are sequentially inferable or not.
\end{itemize}

The mandatory information is the bare-minimum required to implement the branch trace algorithm outlined in Chapter~\ref{Algorithm}.  
The optional information facilitates alternative or improved trace algorithms:

\begin{itemize}
  \item Implicit return mode (see Section~\ref{sec:implicit-return}) requires the encoder to keep track of the number of nested 
    function calls, and to do this it must be aware of all calls and returns regardless of whether the target can be inferred or not;
  \item A simpler algorithm useful for basic code profiling would only report function calls and returns, again 
    regardless of whether the target can be inferred or not;
  \item Branch prediction techniques can be used to further improve the encoder efficiency, particularly for loops 
    (see Section~\ref{sec:branch-prediction}).  This requires the encoder to be aware of the address of all branches, whether
    they are taken or not.
  \item Uninferable jumps can be treated as inferable (which don't need to be reported in the trace output) if both the jump 
    and the preceding instruction which loads the target into a register have been traced.
\end{itemize}

\subsection{Jump classification and target inference} \label{Jump Classes}

Jumps are classified as \textit{inferable}, or \textit{uninferable}.  An \textit{inferable} jump has a target which can be
deduced from the binary executable or representation thereof (e.g. ELF).  For the purposes of this specification, the following 
strict definition applies:

If the target of a jump is supplied via a constant embedded within the jump opcode, it is classified as \textit{inferable}.
Jumps which are not \textit{inferable} are by definition \textit{uninferable}.

However, there are some jump targets which can still be deduced from the binary executable by considering pairs of instructions
even though by the above definition they are classified as uninferable.  Specifically, jump targets that are supplied via

\begin{itemize}
  \item an \textbf{\textit{lui}} or \textbf{\textit{c.lui}} (a register which contains a constant), or
  \item an \textbf{\textit{auipc}} (a register which contains a constant offset from the PC).
\end{itemize}

Such jump targets are classified as \textit{sequentially inferable} if the pair of instructions are retired consecutively 
(i.e. the \textbf{\textit{auipc}}, \textbf{\textit{lui}} or \textbf{\textit{c.lui}} immediately precedes the jump).  Note:
the restriction that the instructions are retired consecutively is necessary in order to minimize the additional signalling
needed between the hart and the encoder, and should have a minimal impact on trace efficiency as it is anticipated that
consecutive execution will be the norm. Support for sequentially inferable jumps is optional.

Jumps may optionally be further classified according to the recommended calling convention:

\begin{itemize}
  \item \textit{Calls}: 
    \begin{itemize}
      \item \textbf{\textit{jal}} x1;
      \item \textbf{\textit{jal}} x5;
      \item \textbf{\textit{jalr}} x1, rs  where rs != x5;
      \item \textbf{\textit{jalr}} x5, rs  where rs != x1;
      \item \textbf{\textit{c.jalr}} rs1 where rs1 != x5;
      \item \textbf{\textit{c.jal}}.
    \end{itemize}
  \item \textit{Tail-calls}: 
    \begin{itemize}
      \item \textbf{\textit{jal}} x0;
      \item \textbf{\textit{c.j}};
      \item \textbf{\textit{jalr}} x0, rs where rs != x1 and rs != x5;
      \item \textbf{\textit{c.jr}} rs1 where rs1 != x1 and rs1 != x5.
    \end{itemize}
  \item \textit{Returns}: 
    \begin{itemize}
      \item \textbf{\textit{jalr}} rd, rs where (rs == x1 or rs == x5) and rd != x1 and rd != x5;
      \item \textbf{\textit{c.jr}} rs1 where rs1 == x1 or rs1 == x5.
    \end{itemize}
  \item \textit{Co-routine swap}: 
    \begin{itemize}
      \item \textbf{\textit{jalr}} x1, x5;
      \item \textbf{\textit{jalr}} x5, x1;
      \item \textbf{\textit{c.jalr}} x5.
    \end{itemize}
  \item \textit{Other}: 
    \begin{itemize}
      \item \textbf{\textit{jal}} rd where rd != x1 and rd != x5;
      \item \textbf{\textit{jalr}} rd, rs where rs != x1 and rs != x5 and rd != x0 and rd != x1 and rd != x5.
    \end{itemize}
\end{itemize}

\subsection{Relationship between RISC-V core and the encoder} \label{sec:relationship}

The encoder is intended to encode the instructions executed on a single hart.  

It is however commonplace for a RISC-V core to contain multiple harts.  This can be 
supported by the core in several different ways:

\begin{itemize}
  \item Implement a separate instance of the interface per hart.  Each instance can be connected
    to a separate encoder instance, allowing all harts to be traced concurrently.  Alternatively,
    external muxing may be used in conjunction with a single encoder in order to trace one particular 
    hart at a time;
  \item Implement a singe interface for the core, with muxing inside the core to select which hart to 
  connect to the interface.
\end{itemize}

(Whilst it is technically feasible to use a single encoder with multiple harts operating 
in a fine-grained multi-threaded configuration, the frequent context changes that would occur
as a result of thread-switching would result in extremely poor encoding efficiency, and so
this configuration is not recommended.)

\section{Instruction Trace Interface}\label{sec:InstructionTraceInterface}
This section describes the interface between a RISC-V hart and the
trace encoder that conveys the information described in the section ~\ref{sec:InstructionInterfaceRequirements}.  
Signals are assigned to one of the  following groups:

\begin{itemize}
  \item M: Mandatory.  The interface must include an instance of this signal.
  \item O: Optional.  The interface may include an instance of this signal.
  \item MR: Mandatory, may be replicated. For harts that can retire a maximum of N "special" instructions per clock
    cycle, the interface must include N instances of this signal.
  \item OR: Optional, may be replicated. For harts that can retire a maximum of N "special" per clock
    cycle, the interface must include zero or N instances of this signal.
  \item BR: Block, may be replicated.  Mandatory for harts that can retire multiple instructions in a block.  
    Replication as per OR.  If omitted, the interface must include SR group signals instead.
  \item SR: Single, may be replicated.  Mandatory for harts that can only retire one instruction in a block.  
    Replication as per OR (see section~\ref{sec:alt-multi}).  If omitted, the interface must include BR 
    group signals instead.
\end{itemize}
"Special" instructions are those that require \textbf{itype} to be non-zero.

\begin{table}[htp]
    \centering
    \caption{Instruction interface signals}
    \label{tab:common-ingress}
    \begin{tabulary}{\textwidth}{|l|l|p{80mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{itype}[\textit{itype\_width\_p}-1:0] & MR & Termination type of the instruction block.  Encoding given in Table~\ref{tab:itype}
         (see Section~\ref{Jump Classes} for definitions of codes 6 - 15).\\
        \hline
        \textbf{cause}[\textit{ecause\_width\_p}-1:0] & M & Exception or interrupt cause (\textbf{\textit{ucause/scause/ vscause/mcause}}).
        Ignored unless \textbf {itype}=1 or 2.\\
        \hline
        \textbf{tval}[\textit{iaddress\_width\_p}-1:0] & M & The associated trap value, e.g. the
        faulting virtual address for address exceptions, as would be
        written to the \textbf{utval/stval/vstval/mtval} CSR. Future optional extensions may define \textbf{tval} 
        to provide ancillary information in cases where it currently supplies zero.
        Ignored unless \textbf{itype}=1.\\
        \hline
        \textbf{priv}[\textit{privilege\_width\_p}-1:0] & M & Privilege level for all instructions retired on this cycle.
        Encoding given in Table~\ref{tab:priv}.  Codes 4-7 optional.\\
        \hline
        \textbf{iaddr}[\textit{iaddress\_width\_p}-1:0] & MR & The address of the 1st instruction retired in this block.
        Invalid if \textbf{iretire}=0 unless \textbf{itype}=1, in which case it indicates the address of the 
        instruction which incurred the exception. \\
        \hline
        \textbf{context}[\textit{context\_width\_p}-1:0] & O & Context for all instructions retired on this cycle.\\
        \hline
        \textbf{time}[\textit{time\_width\_p}-1:0] & O & Time generated by the core.\\
        \hline
        \textbf{ctype}[\textit{ctype\_width\_p}-1:0] & O & Reporting behavior for \textbf{context}.  
        Encoding given in Table~\ref{tab:context-type}.  Codes 2-3 optional.\\
        \hline
        \textbf{sijump} & OR & If \textbf{itype} indicates that this block ends with an uninferable discontinuity, setting this signal to 1 
        indicates that it is sequentially inferable and may be treated as inferable by the encoder if the preceding 
        \textbf{\textit{auipc}}, \textbf{\textit{lui}} or \textbf{\textit{c.lui}} has been traced.  
        Ignored for \textbf{itype} codes other than 6, 8, 10, 12 or 14.\\
        \hline
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Instruction interface signals - multiple retirement per block}
    \label{tab:multi-ingress}
    \begin{tabulary}{\textwidth}{|l|l|p{80mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{iretire}[\textit{iretire\_width\_p}-1:0] & BR & Number of halfwords represented by instructions retired in this block.\\
        \hline
        \textbf{ilastsize}[\textit{ilastsize\_width\_p}-1:0] & BR & The size of the last retired instruction is 2\textsuperscript{\textbf{ilastsize}} half-words.\\
        \hline
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Instruction interface signals - single retirement per block}
    \label{tab:single-ingress}
    \begin{tabulary}{\textwidth}{|l|l|p{80mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{iretire}[0:0] & SR & Number of instructions retired in this block (0 or 1).\\
        \hline
        \textbf{ilastsize}[\textit{ilastsize\_width\_p-1}:0] & SR & The size of the retired instruction is 2\textsuperscript{\textbf{ilastsize}} half-words.\\
        \hline
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Instruction Type (\textbf{itype}) encoding}
    \label{tab:itype}
    \begin{tabulary}{\textwidth}{|l|p{140mm}|}
      \hline
      \textbf{Value} & \textbf{Description} \\
      \hline
      0  & Final instruction in the block is none of the other named \textbf{itype} codes\\
      1  & Exception. An exception that traps occurred following the final retired instruction in the block\\
      2  & Interrupt. An interrupt that traps occurred following the final retired instruction in the block\\
      3  & Exception or interrupt return\\
      4  & Nontaken branch\\
      5  & Taken branch\\
      6  & Uninferable jump if \textit{itype\_width\_p} is 3, reserved otherwise\\
      7  & reserved\\
      8  & Uninferable call\\
      9  & Inferrable call\\
      10 & Uninferable tail-call\\
      11 & Inferrable tail-call\\
      12 & Co-routine swap\\
      13 & Return\\
      14 & Other uninferable jump\\
      15 & Other inferable jump\\
      \hline
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Privilege level (\textbf{priv}) encoding}
    \label{tab:priv}
    \begin{tabulary}{\textwidth}{|l|p{140mm}|}
      \hline
      \textbf{Value} & \textbf{Description} \\
      \hline
        0 & U\\
        1 & S/HS\\
        2 & reserved\\
        3 & M\\
        4 & D (debug mode)\\
        5 & VU\\
        6 & VS\\
        7 & reserved\\
      \hline
    \end{tabulary}
\end{table}

Tables~\ref{tab:common-ingress} and ~\ref{tab:multi-ingress}
list the signals in the interface designed to efficiently support retirement of multiple 
instructions per cycle.  The following discussion describes the multiple-retirement behavior.  
However, for harts that can only retire one instruction at a time, the signalling can be 
simplified, and this is discussed subsequently in Section~\ref{sec:single-retire}.  

The information presented in a block represents a contiguous
block of instructions starting at \textbf{iaddr}, all of which retired
in the same cycle. Note if \textbf{itype} is 1 or 2 (indicating an
exception or an interrupt), the number of instructions retired may be
zero. \textbf{cause} and \textbf{tval} are only defined if
\textbf{itype} is 1 or 2. If \textbf{iretire}=0 and \textbf{itype}=0,
the values of all other signals are undefined.

\textbf{iretire} contains the number of (16-bit) half-words represented by
instructions retired in this block, and \textbf{ilastsize} the size of the last
instruction.  Half-words rather than instruction count enables the encoder to easily compute the address of
the last instruction in the block without having access to the size of every 
instruction in the block.  

\textbf{itype} can be 3 or 4 bits wide.  If \textit{itype\_width\_p} is 3, a single code (6) is used to 
indicate all uninferable jumps.  This is simpler to implement, but precludes use of the implicit return mode
(see section~\ref{sec:implicit-return}), which requires jump types to be fully classified.

Whilst \textbf{iaddr} is typically a virtual address, it does not affect the encoder's behavior
if it is a physical address.

For harts that can retire a maximum of N non-zero \textbf{itype} values per clock
cycle, the signal groups MR, OR and either BR or SR must be replicated N times.  
Typically N is determined by the maximum number of branches that can be retired per clock cycle.
Signal group 0 represents information about the oldest instruction block, and group N-1
represents the newest instruction block. The interface supports no more
than one privilege change, context change, 
exception or interrupt per cycle and so signals in
groups M and O are not replicated. Furthermore, \textbf{itype} can only
take the value 1 or 2 in one of the signal groups, and this must be
the newest valid group (i.e. \textbf{iretire} and \textbf{itype} must
be zero for higher numbered groups). If fewer than N groups are required 
in a cycle, then lower numbered groups must be used
first. For example, if there is one branch, use only group 0, if
there are two branches, instructions up to the 1st branch
must be reported in group 0 and instructions up to the 2nd branch
must be reported in group 1 and so on.

\textbf{sijump} is optional and may be omitted if the hart does not implement the logic to detect
sequentially inferable jumps.  If the encoder offers an \textbf{sijump} input it must also provide a
parameter to indicate whether the input is connected to a hart that implements this capability, or
tied off.  This is to ensure the decoder can be made aware of the hart's capability.  Enabling 
sequentially inferable jump mode in the encoder and decoder when the hart does not support it will
prevent correct reconstruction by the decoder. 

The \textbf{context} and/or the \textbf{time} field can be used to convey any additional information to the decoder.  For example:

\begin{itemize}
  \item The address space and virtual machine IDs (\textbf{ASID} and \textbf{VMID}
  respectively).  Where present it is recommended these values be wired to bits [15:0] and [29:16];
  \item The software thread ID;
  \item The process ID from an operating system;
  \item It could be used to convey the values of CSRs to the decoder by setting \textbf{context} to the 
    CSR number and value when a CSR is written;
  \item In cases where a single encoder is being shared amongst multiple harts 
  (see section~\ref{sec:relationship}), it could also be used to indicate the hart ID, in cases where the 
  hart ID can be changed dynamically.
\item Time from within the hart
\end{itemize}

Table~\ref{tab:context-type} specifies the actions for the various
\textbf{ctype} values.  A typical behavior would be for this signal to
remain zero except on the 1st retirement after a context change
or when a time value should be reported.
\textit{ctype\_width\_p} may be 1 or 2.  The reduced width option only
provides support for reporting context changes imprecisely.

\begin{table}[htp]
    \centering
    \caption{Context type \textbf{ctype} values and corresponding actions}
    \label{tab:context-type}
    \begin{tabulary}{\textwidth}{|l|l|p{90mm}|}
        \hline
        \textbf{Type} & \textbf{Value} & \textbf{Actions} \\
        \hline
        Unreported & 0 & No action (don't report context).\\
        \hline
        Report context imprecisely & 1 & An example would be a SW thread or operating system process change.\newline
        Report the new context value at the earliest convenient opportunity.\newline
        It is reported without any address information, and the assumption is that the precise 
        point of context change can be deduced from the source code (e.g. a CSR write). \\
        \hline
        Report context precisely  & 2 & Report the address of the 1st instruction retired in this block, and the new context.\newline
        If there were unreported branches beforehand, these need to be reported first.\newline
        Treated the same as a privilege change.\\
        \hline
        Report context as an & 3 &  An example would be a change of hart.\\
        asynchronous discontinuity & & 
        Need to report the last instruction retired on the previous context, as well as the 1st on the new context.\newline
        Treated the same as an exception.\\
        \hline
    \end{tabulary}
\end{table}

\subsection{Simplifications for single-retirement} \label{sec:single-retire}

For harts that can only retire one instruction at a time, the interface can be simplified to
the signals listed in tables~\ref{tab:common-ingress} and ~\ref{tab:single-ingress}.  The 
simplifications can be summarized as follows:
 
\begin{itemize}
  \item \textbf{iretire} simply indicates whether an instruction retired or not;
\end{itemize}

\textbf{Note:} \textbf{ilastsize} is still needed in order to determine the address of the next 
instruction, as this is the predicted return address for implicit return mode 
(see Section~\ref{sec:implicit-return}).

The parameter \textit{retires\_p} which indicates to the encoder the maximum number of 
instructions that can be retired per cycle can be used by an encoder capable of supporting single or 
multiple retirement to select the appropriate interpretation of \textbf{iretire}.  


\subsection{Alternative multiple-retirement interface configurations} \label{sec:alt-multi}

For a hart that can retire multiple instructions per cycle, but no more than one branch, the preferred 
solution is to use one instance of signals from groups BR, MR and OR.  However, if the hart can retire 
N branches in a cycle, N instances of signals from groups MR, OR and either SR or BR must be used 
(each instance can be either a single instruction or a block).

If the hart can retire N instructions per cycle, but only one branch, it is allowed (though not recommended)
to provide explicit details of every instruction retired by using N instances of signals from groups
SR, MR and OR.

\subsection{Optional sideband signals}

Optional sideband signals may be included to provide additional functionality, as described in
tables \ref{tab:ingress-side-band} and \ref{tab:egress-side-band}.

Note, any user defined information that needs to be output by the encoder
will need to be applied via the \textbf{context} input.

\begin{table}[htp]
    \centering
    \caption{Optional sideband encoder input signals}
    \label{tab:ingress-side-band}
    \begin{tabulary}{\textwidth}{|l|p{18mm}|p{85mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{impdef}[\textit{impdef\_width\_p}-1:0] & O &  Implementation defined sideband signals.  A typical use for
        these would be for filtering (see Chapter~\ref{ch:filtering}.\\
        \hline
        \textbf{trigger}[2+:0] & [1:0]: O \newline 
                                [2+]:  OR & A pulse on bit 0 will cause the encoder to start tracing, and continue until further 
        notice, subject to other filtering criteria also being met.\newline
        A pulse on bit 1 will cause the encoder to stop tracing until further notice.  See section~\ref{sec:trigger}).\\
        \hline
        \textbf{halted} & O & Hart is halted.  Upon assertion, the encoder will output a packet to report the address 
        of the last instruction retired before halting, followed by a support packet to indicate that tracing has stopped. 
        Upon deassertion, the encoder will start tracing again, commencing with a synchronization packet.
        \textbf{Note:} If this signal is not provided, it is strongly recommended that Debug mode can be signalled via a 3-bit 
          \textbf{privilege} signal.  This will allow tracing in Debug mode to be controlled via the optional filtering capabilities.\\
        \hline
        \textbf{reset} & O & Hart is in reset.  Provided the encoder is in a different reset domain to the hart, this
        allows the encoder to indicate that tracing has ended on entry to reset, and restarted on exit.  
        Behavior is as described above for halt.\\
        \hline
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Optional sideband encoder output signals}
    \label{tab:egress-side-band}
    \begin{tabulary}{\textwidth}{|l|l|p{125mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{stall} & O & Stall request to hart.  Some applications may require lossless trace, which can be achieved by
        using this signal to stall the hart if the trace encoder is unable to output a trace packet (for example due to 
        back-pressure from the packet transport infrastructure).\\
        \hline
    \end{tabulary}
\end{table}

\FloatBarrier
\subsection{Using trigger outputs from the Debug Module} \label{sec:trigger}

The debug module of the RISC-V hart may have a trigger unit. This defines a match control register
(\textbf{\textit{mcontrol}}) containing a 4-bit \textbf{action} field, and reserves codes 2 - 5 
of this field for trace use.  
These action codes are hereby defined as shown in table \ref{tab:debugModuleTriggerSupport}.
If implemented, each action must generate a pulse on an output from the hart, on the same cycle as the instruction which
caused the trigger is retired.  

\begin{table}[!h]
    \centering
    \caption{Debug Module trigger support (\textbf{\textit{mcontrol}} \textbf{action})}
    \label{tab:debugModuleTriggerSupport}
    \begin{tabulary}{\textwidth}{|l|p{140mm}|}
        \hline
        \textbf {Value} & \textbf {Description} \\
        \hline
        2 & \textit{Trace-on}.  This should be connected to 
        \textbf{trigger[0]} if the encoder provides it. \\ 
        \hline
        3 & \textit{Trace-off}.  This should be connected to 
        \textbf{trigger[1]} if the encoder provides it. \\
        \hline
        4 & \textit{Trace-notify}.  This should be connected to 
        \textbf{trigger[1 + \textit{blocks}:}2] if the encoder provides it. This will cause the encoder to output
        a packet containing the address of the last instruction in the block if it is enabled.
        One bit per block.\\
        \hline
    \end{tabulary}
\end{table}

Trace-on and Trace-off actions provide a means for the hart to control when tracing
starts and stops.  It is recommended that tracing starts from the oldest  
instruction retired in the cycle that Trace-on is asserted, and stops following the newest
instruction retired in the cycle that Trace-off is asserted (subject to any optional filtering).

Trace-notify provides means to ensure that a specified instruction is explicitly reported 
(subject to any optional filtering). This capability is sometimes known as a watchpoint.

\FloatBarrier
\subsection{Example retirement sequences}

\begin{table}[htp]
    \centering
    \caption{Example 1 : 9 Instructions retired over four cycles, 2 branches} 
    \label{tab:signal-block-9-instructions-2-branches}
    \begin{tabulary}{\textwidth}{|l|l|}
        \hline
        \textbf {Retired} & \textbf {Instruction Trace Block} \\
        \hline
        1000: \textbf{\textit{divuw}} &  \textbf{iretire}=7, \textbf{iaddr}=0x1000, \textbf{itype}=8\\
        1004: \textbf{\textit{add}} &  \\
        1008: \textbf{\textit{or}} &  \\
        100C: \textbf{\textit{c.jalr}} &  \\
        \hline
        0940: \textbf{\textit{addi}} &  \textbf{iretire}=3, \textbf{iaddr}=0x0940, \textbf{itype}=4\\
        0944: \textbf{\textit{c.beq}} &  \\
        \hline
        0946: \textbf{\textit{c.bnez}} &  \textbf{iretire}=1, \textbf{iaddr}=0x0946, \textbf{itype}=5\\
        \hline
        0988: \textbf{\textit{lbu}} &  \textbf{iretire}=4, \textbf{iaddr}=0x0988, \textbf{itype}=0\\
        098C: \textbf{\textit{csrrw}} &  \\
        \hline
    \end{tabulary}
\end{table}

\section{Data Trace Interface requirements}\label{sec:DataInterfaceRequirements}

This section describes in general terms the information which must be
passed from the RISC-V hart to the trace encoder for the purposes of
Data Trace, and distinguishes between what is mandatory, and what is
optional.

If Data Trace is not needed in a system then there is no requirement
for the RISC-V hart to supply any of the signals in section
~\ref{sec:DataTraceInterface}.

Data trace supports up to four data access types: load, store, atomic and CSR.  
Support for both atomic and CSR accesses are independently optional.

The signalling protocol can take one of two forms, depending on the needs of the RISC-V 
hart: \textit{unified} or \textit{split}.

Unified is the simplest form, suitable for simpler, in-order harts.  In this form, all
information about a data access is signalled by the RISC-V hart in the
same cycle that the associated data access instruction is reported on the instruction
trace interface.  

For harts with out of order or speculative execution capabilities, many loads may be in 
progress simultaneously, and this approach is not practical as it would require the hart 
to maintain a large amount of state relating to all the in-progress loads.  For this 
reason, the interface also supports splitting loads into two parts:

\begin{itemize}
  \item The \textit{request} phase provides all the information about the load that
    originates from the hart (address, size, etc.) when the instruction retires;
  \item The \textit{response} phase provides the data and response status when it has
    been returned to the hart from the memory system.
\end{itemize}

The two parts of a split load are associated by use of a transaction ID.

\textcolor{red}{The push and pop instructions proposed by the code size working group
will each result in multiple loads or stores.  To allow the resulting loads or stores to be
associated with the correct instruction, the push or pop must be reported 
on the instruction trace interface multiple times (once for each individual load or store)
using \textbf{itype} 0 except for the final load resulting from a popret instruction which 
must use \textbf{itype} 13 to signal the return}.  
If an exception
occurs part way through the sequence of loads or stores initiated by such an instruction,
and the instruction is re-executed after the exception handler has been serviced, the
load or store sequence must recommence from the beginning.
If an exception
occurs part way through the sequence of loads or stores initiated by such an instruction,
and the instruction is re-executed after the exception handler has been serviced, the
load or store sequence must recommence from the beginning.

\section{Data Trace Interface}\label{sec:DataTraceInterface}

This section describes the interface between a RISC-V hart and the
trace encoder that conveys the information described in the
section~\ref{sec:DataInterfaceRequirements}.  Signals are assigned to
one of the following groups:

\begin{itemize}
  \item M: Mandatory.  The interface must include an instance of this signal;
  \item U: Unified.  Mandatory for unified signalling;
  \item S: Split.  Mandatory for split load signalling;
  \item O: Optional.  The interface may include an instance of this signal.
\end{itemize}

\begin{table}[htp]
    \centering
    \caption{Data interface signals}
    \label{tab:data-ingress}
    \begin{tabulary}{\textwidth}{|l|l|p{80mm}|}
        \hline
        \textbf{Signal} & \textbf{Group} & \textbf{Function} \\
        \hline
        \textbf{dretire} & M & Data access retired (when high)\\
        \hline
        \textbf{dtype}[\textit{dtype\_width\_p}-1:0] & M & Data access type.  Encoding given in Table~\ref{tab:dtype}\\
        \hline
        \textbf{daddr}[\textit{daddress\_width\_p}-1:0] & M & The data access address\\
        \hline
        \textbf{dsize}[\textit{dsize\_width\_p}-1:0] & M & The data access size is 2\textsuperscript{\textbf{dsize}} bytes\\
        \hline
        \textbf{data}[\textit{data\_width\_p}-1:0] & U & The data\\
        \hline
        \textbf{iaddr\_lsbs}[\textit{iaddr\_lsbs\_width\_p}-1:0] & O & LSBs of the data access instruction address.  Required if \textit{retires\_p} > 1\\
        \hline
        \textbf{dblock}[\textit{dblock\_width\_p}-1:0] & O & Instruction block in which the data access instruction is retired.  Required if there 
          are replicated instruction block signals\\
        \hline
        \textbf{lrid}[\textit{lrid\_width\_p}-1:0] & S & Load request ID. Valid when \textbf{dretire} is high\\
        \hline        
        \textbf{lresp}[\textit{lresp\_width\_p}-1:0] & S & Load response:\newline
        0: None\newline
        1: reserved\newline
        2: Okay.  Load successful; \textbf{ldata} valid\newline
        3: Error. Load failed; \textbf{ldata} not valid\\
        \hline        
        \textbf{lid}[\textit{lrid\_width\_p}-1:0] & S & Split Load ID. Valid when \textbf{lresp} is non-zero\\
        \hline        
        \textbf{sdata}[\textit{sdata\_width\_p}-1:0] & S & Store data. Valid when \textbf{dretire} is high\\
        \hline        
        \textbf{ldata}[\textit{ldata\_width\_p}-1:0] & S & Load data. Valid when \textbf{lresp} is non-zero\\
        \hline        
    \end{tabulary}
\end{table}

\begin{table}[htp]
    \centering
    \caption{Data access type (\textbf{dtype}) encoding}
    \label{tab:dtype}
    \begin{tabulary}{\textwidth}{|l|p{140mm}|}
      \hline
      \textbf{Value} & \textbf{Description} \\
        \hline
        0  & Load\\
        1  & Store\\
        2  & reserved\\
        3  & reserved\\
        4  & CSR read-write\\
        5  & CSR read-set\\
        6  & CSR read-clear\\
        7  & reserved\\
        8  & Atomic swap\\
        9  & Atomic add\\
        10 & Atomic AND\\
        11 & Atomic OR\\
        12 & Atomic XOR\\
        13 & Atomic max\\
        14 & Atomic min\\
        15 & Conditional store failure\\
      \hline
    \end{tabulary}
\end{table}

All signals in M, U and O groups are only valid when \textbf{dretire} is high.  Signals in the S group are valid as
indicated in table \ref{tab:data-ingress}.  

For harts that can retire a maximum of M data accesses per cycle, the implemented signal
groups must be replicated M times.  If fewer than M groups are required in a cycle, then lower numbered groups must 
be used first. For example, if there is one data access, use only group 0.

The maximum value of \textit{dtype\_width\_p} is 4.  However, if only loads and stores are supported, 
\textit{dtype\_width\_p} can be 1.  If CSRs are supported but atomics are not, \textit{dtype\_width\_p} can be 3.

Atomic and CSR accesses have either both load and store data, or store data and an operand.  For CSRs and unified
atomics, both values are reported via \textbf{data}, with the store data in the LSBs and the load data or operand
in the MSBs.

\textit{lrid\_width\_p} is determined by the maximum number of loads that can be in progress simultaneously, such
that at any one time there can be no more than one load in progress with a given ID.

\textbf{iaddr\_lsbs} and \textbf{dblock} are provided to support filtering of which data accesses to trace
based on their instruction address.  This is best illustrated by considering the following instruction sequence:

\begin{enumerate}
  \item load
  \item <some non data access instruction>
  \item load
  \item <some non data access instruction>
  \item <some non data access instruction>
\end{enumerate}
 
Suppose the hart is capable of retiring up to 4 instructions in a cycle, via a single block.   Instruction trace is 
enabled throughout, but the requirement is to collect data trace for the 1st load (instruction 1), and filtering is
configured to match the address of this instruction only.  However, information about instruction addresses is passed 
to the encoder at the block level, and the block boundaries are invisible to the decoder.  For instruction trace, 
all instructions in a block are traced if any of the instructions in that block match the filtering criteria.  
That is fine for instruction trace - the address of the 1st and last traced instruction are output explicitly.  
There will be some fuzziness about precisely what those addresses will be depending on where the block boundaries 
fall, but this is not a concern as everything is always self-consistent.
 
However, that is not the case for data trace.  Consider two scenarios:

\begin{itemize}
  \item Case 1: 1st block contains instructions 1, 2, 3; second block contains 4, 5
  \item Case 2: 1st block contains instructions 1, 2; second block contains 3, 4, 5
\end{itemize}
 
Given that \textbf{iretire} is non-zero in the same cycle that the data access retires, the encoder knows the address of 
the 1st and last instructions in a block, but does not know precisely where in the block the data access is.  
In both cases, the first block matches the filtering criteria (it contains the address of instruction 1), and the second block does not.  
But if the encoder traced all the data accesses in the matching block, then in case 1 it would trace both instructions 1 and 3, whereas 
in the second case it would trace only instruction 1.  The decoder has no visibility of the block boundaries so cannot account for this.
It is expecting only instruction 1 to be traced, and so may misinterpret instruction 3.  If this code is in a loop for example, it will 
assume that the 2nd traced load is in fact instruction 1 from the next loop iteration, rather than instruction 3 from this iteration.
 
Providing the LSBs of the data access instruction address allows the decoder to determine precisely whether the data access should be 
traced or not, and removes the dependency on the block sizes and boundaries.  The number of bits required is one more bit than the 
number required to index within the block because blocks can start on any half-word boundary.
 
For harts that replicate the block signals to allow multiple blocks to retire per cycle it is also necessary to indicate which block 
each data access is associated with, so the encoder knows which block address to combine with the LSBs in order to construct the 
actual data access instruction address.  1 bit for 2 blocks per cycle, 2 bits for 4, and so on.

